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In this paper, we present a statistical-mechanical analysis of deep learning. We elucidate some 
of the essential components of deep learning—pre-training by unsupervised learning and fine 
tuning by supervised learning. We formulate the extraction of features from the training data 
as a margin criterion in a high-dimensional feature-vector space. The self-organized classifier 
is then supplied with small amounts of labelled data, as in deep learning. Although we employ 
a simple single-layer perceptron model, rather than directly analyzing a multi-layer neural 
network, we find a nontrivial phase transition that is dependent on the number of unlabelled 
data in the generalization error of the resultant classifier. In this sense, we evaluate the efficacy 
of the unsupervised learning component of deep learning. The analysis is performed by the 
replica method, which is a sophisticated tool in statistical mechanics. We validate our result 
in the manner of deep learning, using a simple iterative algorithm to learn the weight vector 
on the basis of belief propagation. 


1. Introduction 

Deep learning is a promising technique in the field of machine learning, with its outstand¬ 
ing performance in pattern recognition applications, in particular, being extensively reported. 
The aim of deep learning is to efficiently extract important structural information directly 
from the training data to produce a high-precision classifier. 1 ’ The technique essentially con¬ 
sists of three parts. First, a large number of hidden units are introduced by constructing a 
multi-layer neural network, known as a deep neural network (DNN). This allows the im¬ 
plementation of an iterative coarse-grained procedure, whereby each high-level layer of the 
neural network extracts abstract information from the input data. In other words, we introduce 
some redundancy for feature extraction and dimensional reduction (a kind of sparse represen- 
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tation) of the given data. The second part is pre-training by unsupervised learning. This is a 
ki nd of self-organization. 2 ’ To accomplish self-organization in the DNN, we provide plenty 
of unlabelled data. The network learns the structure of the input data by tuning the weight 
vectors (often termed the network parameters) assigned to each layer of the neural network. 
The procedure of updating each weight vector on the basis of the gradient method, i.e., back 
propagation, takes a relatively long time 3) and its regularization by L\ norm and greedy algo- 
rithm. 4 6 ’ This is because many local minima are found during the optimization of the DNN. 
Instead, techniques such as the auto-encoder have been proposed to make the pre-training 
more efficient and push up the basins of attraction of the minima via a better generalization 
of the training data. 7-9 ’ The third component of deep learning involves fine tuning the weight 
vectors using supervised learning to elaborate DNN into a highly precise classifier. This com¬ 
bination of unsupervised and supervised learning enables the architecture of deep learning to 
obtain better generalization, effectively improving the classification under a semi-supervised 
learning approach. Ia 1 ’’ 

In the present study, we focus on the latter two parts of deep learning. The first is ne¬ 
glected because it simply highlights a way of implementing the deep learning algorithm. A 
recent study has formulated a theoretical basis for the relationship between the recursive ma¬ 
nipulation of variational renormalization groups and the multi-layer neural network in deep 
learning. 12) Indeed, it is confirmed that the renormalization group indeed can mitigate the 
computational cost in the learning without any significant degradation. 13 ’ Furthermore, the 
direct evaluation of multi-layer neural networks is too complex to fully clarify the early stages 
of our theoretical understanding of deep learning. Although most of the DNN is constructed 
by a Boltzmann machine with hidden units, we simplify the DNN to a basic perceptron. This 
simplification, which is just for our analysis, enables us to shed light on the fundamental 
origin of the outstanding performance of deep learning and the efficiency of pre-training by 
unsupervised learning. 

The steady performance of the classifier constructed by the deep learning algorithm can 
be assessed in terms of the generalization error using a statistical-mechanical analysis based 
on the replica method. 14 ’ We consequently find nontrivial behaviour involved in the emer¬ 
gence of the metastable state of the generalization error, a result of the combination of unsu¬ 
pervised and supervised learning. This is analogous to the metastable state in classical spin 
models, which leads to the hysteresis effect in magnetic fields. Following the actual process 
of deep learning, we numerically test our result by successively implementing the unsuper¬ 
vised learning of the pre-training procedure and the supervised learning for fine tuning. We 
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then demonstrate the effect of being trapped in the metastable state, which worsens the gen¬ 
eralization error. This justifies the need for fine tuning by several sets of labelled data after 
the pre-training stage of deep learning. 

The remainder of this paper is organized as follows. In the next section, we formulate our 
simplified model to represent unsupervised and supervised learning with structured data, and 
analyze the Bayesian inference process for the weight vectors. In Section 3, we investigate the 
nontrivial behaviour of the generalization error in our model. We demonstrate that the gen¬ 
eralization error can be significantly improved by the use of sufficient amounts of unlabelled 
data. Finally, in Section 4, we summarize the present work. 


2. Analysis of combination of unsupervised and supervised learning 

2.1 Problem setting 

We deal with a simple two-class labelled-unlabelled classification problem. We assume 
that the fV-dimensional feature vectors x /( e R N obey the following distribution function 
conditioned on the binary label = ±1 for each datum p and a predetermined weight vector 


w 0 : 


*e? 


Pg(x»\y». Wo) oc 0 [ —f=xjw 0 - g 


( 1 ) 


where g is a margin, which resembles the structure of the feature vectors of the given data, 
and 


0(v) = • 


1 


x > 0 


( 2 ) 


0 x < 0 

The labelled data (x^vy) ip = 1,2, • • • ,L) are generated from the joint probability 
P g (x M \y M , w 0 )P(y /( ), where L is the number of labelled data. The unlabelled data (x /( ) (p = 
L + 1, L + 2, • • • ,L + U ), where U is the number of unlabelled data, follow the marginal prob¬ 
ability /f.fx^Wo) = Xv:, F„(x /( |>' /( , w 0 )P(y fJ ). In the following, we assume the large-A limit 
and a huge number of data L,U~ 0(N), as well as a symmetric distribution for the label 


P(y M ) = 1/2. 

The likelihood function for the dataset is defined as 

L L+U 

P g (£>\wo) = Y\ ^'(X/^w,))/ 3 ^) Y\ p g( x r l w o)’ (3) 

fi =1 fi=L+\ 

where D denotes the dataset consisting of labelled data and unlabelled data. When the feature 
vector g has a margin value of zero, unsupervised learning is no longer meaningful, because 
the marginal distribution becomes flat. However, nonzero values of the margin elucidate the 
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structure of the feature vectors through the unsupervised learning. The actual data in im¬ 
ages and sounds have many inherent structures that must be represented by high-dimensional 
weight vectors in the multi-layer neural networks of DNN. In the present study, we simplify 
this aspect of the actual data to give an artificial model with a margin that follows the simple 
perceptron. This allows us to assess certain nontrivial aspects of deep learning. 


2.2 Bayesian inference and replica method 

For readers unfamiliar with deep learning, we sketch the procedure of the deep learning 
here. The first step of the deep learning algorithm is to conduct pre-training. Following the 
unsupervised learning, the weight vector learns the features of the training data without any 
labels. As a simple strategy, we often estimate the weight vector to maximize the likelihood 
function only for the unlabelled data as 

|w)|. (4) 

We use a different margin value h from one in Eq. (3) in order to evaluate a generic case 
below. When we know a priori the structure of the data, one may set g = h. We may utilize 
the hidden units to prepare some redundancy to represent the feature of the given data. In the 
present study, we omit this aspect to simplify the following analysis. In other words, we have 
a coarse-graining picture of DNN only by a single layer with a weight vector w, the input x^ 
and output y^. In the second step, termed as the fine tuning step, we estimate the weight vector 
to precisely classify the training data. For instance, the maximum likelihood estimation can 
be a candidate to estimate the weight vector as 

( L L+U j 

log n w)F(yp) J J P Axjw) >. (5) 

r= i r=L +i J 

We notice an important thing of the deep learning architecture. In this procedure, we use the 
result of the pre-training w PT as an initial condition for the gradient method to obtain w FT . The 
purpose of the deep learning is just obtain the weight vector to classify the newly-generated 
data with better performance simply from some strategy as in Eq. (5). The computational 
cost of the often-employed methods (e.g. back propagation 3 ^ becomes extremely longer in 
general. However if we have some adequate initial condition to manipulate the estimation, 
we can mitigate harmful computation and reach a better estimation of the weight vector. 8 ’ 9) 
In order to evaluate the theoretical limitation of the deep learning, instead of the maximum 
likelihood estimation, we employ an optimal procedure based on the framework of Bayesian 
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inference. The posterior distribution can be given by the Bayes’ formula as 

P,,(£>|w)P(w) 

P /( (w|£)) = --. (6) 

f dw'Ph(D\w')P(w') 

We assume that the prior distribution for the weight vector is P( w) oc 6 (|w| 2 - N^. The poste¬ 
rior mean given by this posterior distribution provides an estimator for the quantity related to 
the weight vector: 

P h (D\w)P(w) 


E W |»[/(w)] 


-/ 


Jw/(w)- 


f dw'P/,(£)\w')P(w') 

The typical value is evaluated by averaging over the randomness of the dataset as 

P h (D\w)P(w) 


■‘T) 


[E w |©[g(w)]] 
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dwg(w) 


f dw' Ph(D\yf')P(w') 
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where 


[■•■]© 


■/ 


dDdw 0 P „(1)|w ( ))P(w ( ,) x 


(7) 


( 8 ) 


(9) 


The average quantity is given by the derivative of the characteristic function, namely the free 
energy, which is defined as 


-T = lim — 

N —>oo N 


f 


log I dwP h (Diw)P(w) 


( 10 ) 


m 


In particular, as shown below, the derivative of the free energy yields a kind of self-consistent 
equations for the physically-relevant quantities. In this problem, we compute the overlap 
between the estimated w and the original weight vectors Wo and the variance of the weight 
vectors, which quantify the precision of the learning. Following spin glass theory, 14 * we apply 
the replica method to evaluate the free energy. We define the replicated partition function as 


S„ = (J' dwP/,(Vjw)P(w)j . 


( 11 ) 


The (density of) free energy can be calculated from the replicated partition function through 
the replica method as 

c) 1 

~ T = lin ?,7T i, im Tr (12) 

n —>0 OH N—>o o vv 

We exchange the order of the operations on n and the thermodynamic limit N — » oo, and 
assume that the replica number n is temporarily a natural number in the evaluation of 
We introduce the following constraints to simplify the calculation dependent on w fl : 

r dQ F~[ dl() ah - Y\ 6 \Qv« - T/ W 0 W « 

J a>b ' V / n=n \ V 


( 13 ) 
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The free energy is then given by solving an extremization problem: 


-T - sup [Q(Q) - -T(0], 

Q 


(14) 


where 


Q(Q) 


HQ) 


or log 


© Oh) - g) P] © iu a - h) 
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a= 1 


sup 
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V a>b 


exp 


M{Q) = 

V a>b 

Here, a = L/N, (3 = U/N, and 


QabQab + ^ QoaQoa ~ log M(Q) 

a= 1 

n 

Y QabVtaVfb + Yj 0OflW O W fl 


a -1 


®(m, h ) = —0 (u - h) + -0 {—u - h ). 


(15) 

(16) 

(17) 

(18) 


The expectation is taken over the distribution n«=o ^(w a ). We introduce auxiliary parameters 
Q a b to give an integral representation of the Kronecker’s delta. We use [■••]„ to denote the 
average with respect to the (n + l)-multivariate Gaussian random variables { u a } with vanishing 
mean and covariance [u a Ub\ u = Sab + Qab{ 1 - $ab)- 

2.3 Replica-symmetric solution 

Let us evaluate the replica-symmetric solution by imposing invariant symmetry for Q ab 
and Q a b under permutation of the replica index as 

Qaa ~ 1 Qab — <7 Qoa ~ tffl 
Qaa ~ Q Qab q Qoa tfl. 

Then, the Gaussian random variables can be written as u a = Qqz + yj 1 - qt a for a > 0 and 
Mo = sjm 2 /qz + sj 1 - m 2 /qto using the auxiliary normal Gaussian random variables { t a } and z 
with vanishing mean and unit variance. Under the RS assumption, we obtain an explicit form 
for the free energy by solving the saddle-point equation for Q, q, and nr. 


(19) 


-T 


f DzH 
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J 
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The partial derivatives of the free energy (20) with respect to m and q lead to the saddle- 
point equations for the physically-relevant RS order parameters, namely the overlap m and 
the variance w of the weight vector: 


a 



mz + yfqg 
yjq - m 2 


\ 


7 


H' 



\ 


h\^\ 

V \ yi -q] ) 


+/3 J " DzG' g (m, yfq) 
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Gh( yfq, 1) 
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i -q 
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G h (yjq,l)) (1 -q) 


2’ 


( 22 ) 


(23) 


where H\x ) = - exp(-z 2 /2)/ A fin and 


G' h (a,b) 




az + bh 
y/b 2 - a 2 


H' 


az - bh 
y/b 2 - a 2 


(24) 


The RS solution always satisfies q = m under the condition g = h (the Bayes-optimal solu¬ 
tion). The above saddle-point equations are then reduced to the following single equation for 


q: 


a 


f 


Dz 


H' 


y/qz+h 






+/3 Dz 




Gh( yjq, 1 ) 1 - q 


(25) 


The order parameter q is closely related to the generalization error, which is defined as the 
probability of disagreement between the labelled data and the classifier outputs for the newly 
generated example after the classifier has been trained. In the case of an input-output relation 
given by a simple perceptron, the generalization error is expressed as: 14) 

6 = — cos -1 q. (26) 

n 

We will evaluate this quantity to validate the performance of the classifier generated from the 
combination of unsupervised and supervised learning. 

































J. Phys. Soc. Jpn. 



P P 


Fig. 1 . (color online) Generalization errors for h = 0.1,0.05,0.03,0.02, and 0.01 (curves from left to right). 
The left panel shows the results for a = 1, and the right one represents a = 10. Both cases exhibit multiple 
solutions for the same value of f J >. 


3. Saddle point and numerical verification 

In Fig. 1, we plot the logarithm of the generalization error with respect to the number of 
supervised learning data for several values of h. Each plot shows the results for a different 
value of a. Note that when there is no fine tuning through supervised learning (i.e., a = 0), 
the generalization error does not exhibit any nontrivial behaviour. However, for nonzero a, 
we find nontrivial curves, which give multiple solutions for the same / 3 , in the fi - e plane. 
This is a remarkable result for the combination of unsupervised and supervised learning. The 
nontrivial curves imply the existence of a metastable state, similar to several classical spin 
models. 15) As h decreases, the spinodal point f3 sp (the point at which the multiple solutions 
coalesce) moves to larger values of p. This is because decreasing h leads to difficulties in the 
classification of the input data. In other words, we need a vast number of unlabelled data to 
attain the lower-error state for a fixed number of labelled data. However, the metastable state 
remains up to a large value of ft, causing the computational cost to become very expensive. 
We therefore need an extremely long computational time to reach the lower-error solution, or 
find good initial conditions nearby. On the other hand, increasing a causes the spinodal points 
to move to lower values of p. Although this confirms an improvement in the generalization 
error for the higher-error state, there is no quantitative change in that for the lower-error state. 
In this sense, pre-training is an essential part of the architecture of deep learning if we wish to 
achieve the lower-error state—this is the origin of deep learning’s remarkable performance. 
In contrast, the emergence of the metastable state causes the computational cost to increase 
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drastically. Several special techniques could be incorporated into the architecture of deep 
learning to avoid this weak point, effectively preparing good initial conditions that enable the 
lower-error state to be reached. 8,9) 

The asymptotic form of H(x) ~ 0(x) exp(-jc 2 /2)/|x| for v —» oo leads to the exponent of 
the learner curve, 14) which characterizes the decrease in the generalized error in a » 1 and 
(3 » 1 as e g ~ ( c a a 2 + caf3 + C(J3 2 )~ 1 . Here, c a , Cp, and c are the constants evaluated by the 
Gaussian integrals. Thus, there is no quantitative change in the exponent of the learning curve 
in this formulation compared with that of the perceptron with ordinary supervised learning. 

Next, let us consider the effect of fine tuning in the context of deep learning. If we plot 
the saddle-point solutions in the a - e plane, we find that multiple solutions appear in a 
certain region. Increasing the number of unlabelled data again leads to an improvement in the 
generalization error. A gradual increase in the number of labelled data allows us to escape 
from the metastable state. In this sense, fine tuning by supervised learning is necessary to 
achieve the lower-error state and mitigate the difficulties in reaching the desired solution. We 
should emphasize that the emergence of the metastable state does not come from the multi¬ 
layer neural networks in DNN, but from the combination of unsupervised and supervised 
learning. This observation was also noted in a previous study. 16) 

To verify our analysis, we conduct numerical experiments using the so-called approxi¬ 
mate message passing algorithm. 17) On the basis of the reference in the modern fashion, 18) 
we can construct an iterative algorithm to infer the weight vector using both the unlabelled 
and labelled data. The update equations are 

<' = Yj x » kWk ~\t c \ at n^A 

k= 1 ' ' 


where 


C^a, b, h) 



exp(-£/2) 

y " V2 nbH(z-) 

exp(-£/2)-exp(-4/2) 
y/2nb(H(z~) + H(z + )) 


(M<L) 


(28) 

(29) 

(30) 


(31) 
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Dfj(a, b, h) 



(ju>L). 




(32) 


Here x,^ is the Mi component of the feature vector of the datum /r and w* is the Mi component 


of the weight vector. We use the abbreviation z± = (h ±a)/ Vb, and estimate the weight vector 


from w = a'/V. In the numerical experiments, we first estimate the weight vector using only 
the unlabelled data, i.e., a = 0. We then gradually increase the number of labelled data while 
estimating the weight vector. The system size is set to TV = 100, and the number of samples 
iV sa m = 1000. The maximum iteration number for fine tuning is set to 20. In Fig. 2, we plot 
the average generalization error over N sam independent runs starting from the randomized ini¬ 
tial conditions. As theoretically predicted, our results confirm the water-falling phenomena 
for several cases with h = 0.5. Increasing the number of labelled data in the fine tuning step 
allows us to escape from the metastable state. Therefore, fine tuning is a necessary compo¬ 
nent in the remarkable performance of deep learning. However, the difficulty of classification, 
represented by h, demands a large number of training data. Therefore, we require the initial 
condition to be as good as possible in the fine tuning to reach the lower-error state. Several 
empirical studies of the deep learning algorithm have revealed that special techniques such as 
the auto-encoder can provide initial conditions that are sufficiently good to improve the per¬ 
formance after fine tuning. 9) In future work, we intend to clarify that such specific techniques 
do indeed overcome the degradation in performance caused by the metastable state. 

4. Conclusion 

We have analyzed the simplified perceptron model under a combination of unsupervised 
and supervised learning for data with a margin. The margin imitates the structure of the 
training data. We have found nontrivial behaviour in the generalization error of the classifier 
obtained by this hybrid of unsupervised and supervised learning. First, we confirmed the 
remarkable improvement in the generalization error by increasing the number of unlabelled 
data. In this sense, the pre-training step in deep learning is essential when few labelled data 
are available. In addition, our result reveals the existence of the metastable solution, which 
hampers the ordinary gradient-based iteration to pursue the optimal estimation. In the deep 
learning algorithm, the pre-training technique is crucial in reducing the computation time 
and attaining good performance, because good initial conditions allow the algorithm to reach 
the lower-error state. Instead of focusing on the specialized pre-training technique, we have 
investigated a nontrivial behaviour involved in the metastable state and the existence of the 
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Fig. 2. (Color online) Numerical test using approximate message passing. We illustrate the case with h = 0.05 
for p — 100 (blue) and fi = 200 (red). Error bars are shown for each plot over N sam = 1000 samples. 


lower-error state, which is used in the deep learning. In addition, we have analyzed the role 
of fine tuning by changing the number of labelled data. This also confirmed the nontrivial 
behaviour in the generalization error. Our numerical experiments demonstrated the water¬ 
falling phenomena involved in the existence of the metastable state and confirms that after 
fine tuning we reach the lower-error state. 

We make a remark on the statistical-mechanical analysis for a similar problem setting, 
namely that of semi-supervised learning. A previous analysis also revealed the existence of 
the metastable state. 16) The present study suggests that the metastable state is essential in the 
combination of unsupervised and supervised learning. In this sense, for the sake of the further 
development to efficiently perform the deep learning, we should invent some techniques to 
escape from the metastable state, 

Our present work is one instance in which a simplified model can demonstrate the essence 
of deep learning and clarify certain theoretical aspects. We hope that future studies will “ex¬ 
tract the features” of the architecture of deep learning. 
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